6 High Dimensional Linear Regression

#ReLU #Regularization #MLE #LinearRegression #CrossValidation

Continue with last model, we consider the model (plug in all possible $c_{i}$ 's): $\begin{aligned} y_{t} = & β_{0} + β_{1} (t - 1) + β_{2} ReLU (t - 2) + \dots \\ (1) & + β_{n - 1} ReLU (t - (n - 1)) + ε_{t} . \end{aligned}$
Note that it is different from our last model: $\begin{aligned} y_{t} = & β_{0} + β_{1} (t - 1) + β_{2} ReLU (t - c_{1}) \\ (2) & + \dots + β_{k + 1} ReLU (t - c_{k}) + ε_{t} . \end{aligned}$
The new model does not have parameters $c_{k}$ , so it is linear (discussed here). Also (2) will be used with a small $k$ , while (1) can have large $n$ . In short, (1) is a high-dimensional linear regression model, and (2) is a low-dimensional nonlinear regression model.

(1) has ample parameters so it is a flexible model.

1 Parameter Interpretation in (1)

Let $μ_{t}$ denote the deterministic part in (1) (without $ε_{t}$ ): $\begin{aligned} μ_{t} = & β_{0} + β_{1} (t - 1) + β_{2} ReLU (t - 2) + \dots \\ + β_{n - 1} ReLU (t - (n - 1)) . \end{aligned}$ Then (1) can be rewritten as $y_{t} = μ_{t} + ε_{t}, ε_{t} \overset{i . i . d}{\sim} N (0, σ^{2}) .$

The noise term can have several interpretations.

When we see $μ_{t}$ as trend, $ε_{t}$ can be seen as random fluctuations.

When we see $μ_{t}$ as actual data, $ε_{t}$ can be seen as measurement error.

Plug in $t = 1$ , we have $β_{0} = μ_{1} .$ Plug in $t = 2$ , we have $β_{1} = μ_{2} - μ_{1} .$ Similarly, $β_{t} = (μ_{t + 1} - μ_{t}) - (μ_{t} - μ_{t - 1}) .$

If we use this model to represent logarithm of population, then

$100 β_{t}$ represents difference between percentage change from $t$ to $t + 1$ , and percentage change from $t - 1$ to $t$ , for $t \geq 1$ ;

$β_{0}$ represents the scale of data

$100 β_{1}$ represents the percentage change: let $P_{t} = \exp (μ_{t})$ , then $β_{1} = \log P_{2} - \log P_{1} = \log \frac{P_{2}}{P_{1}} \approx \frac{P_{2}}{P_{1}} - 1.$

So they have different scales.

2 Parameter Estimation

2.1 Unregularized MLE

For linear regression model (1), we can estimate as usual by MLE: $\sum_{t = 1}^{n} (y_{t} - β_{0} - β_{1} (t - 1) - β_{2} ReLU (t - 2) - \dots - β_{n - 1} ReLU (t - (n - 1)))^{2} .$
We have ${\hat{σ}}_{MLE} = \sqrt{\frac{RSS}{n}} .$
However now the number of data points is equal to coefficients, so $RSS = 0, \hat{σ} = 0$ . The unbiased estimate $σ$ will not exist because $\sqrt{\frac{RSS}{n - p}}$ and $n = p$ .

So traditional estimates overfit the data.

2.2 Regularization

Now we add regularization. ${\hat{β}}_{ridge} (λ)$ (ridge estimator) is the minimizer of $\sum_{t = 1}^{n} {(y_{t} - β_{0} - β_{1} (t - 1) - \sum_{j = 2}^{n - 1} ReLU (t - j))}^{2} + λ \sum_{j = 2}^{n - 1} β_{j}^{2} .$
We also have LASSO estimator ${\hat{β}}_{lasso} (λ)$ : $\sum_{t = 1}^{n} {(y_{t} - β_{0} - β_{1} (t - 1) - \sum_{j = 2}^{n - 1} ReLU (t - j))}^{2} + λ \sum_{j = 2}^{n - 1} | β_{j} | .$

Correspondingly, plug in $μ$ , RIDGE goes to Hodrick-Prescot Filter: $\sum_{t = 1}^{n} (y_{t} - μ_{t})^{2} + λ \sum_{t = 2}^{n - 1} [(μ_{t + 1} - μ_{t}) - (μ_{t} - μ_{t - 1})^{2}] .$
LASSO goes to $L_{1}$ -trend Filter: $\sum_{t = 1}^{n} (y_{t} - μ_{t})^{2} + λ \sum_{t = 2}^{n - 1} | (μ_{t + 1} - μ_{t}) - (μ_{t} - μ_{t - 1}) | .$

Fact (Simple Ridge)

For $f (β) = (y - β)^{2} + λ β^{2}, λ > 0$ , we can easily find the minimizer $\hat{β} = \frac{y}{1 + λ} .$

We can rewrite this into matrix form: $| | y - X β | |^{2} + λ \sum_{t = 2}^{n - 1} β_{j}^{2} .$ Take $J = diag (0, 0, 1, \dots, 1)$ , then $\nabla (| | y - X β | |^{2} + λ \sum_{t = 2}^{n - 1} β_{j}^{2}) = - 2 X^{T} y + 2 X^{T} X β + 2 λ J β .$ Let it be $0$ , we have $\begin{matrix} (2.1) & {\hat{β}}^{ridge} (λ) = (X^{T} X + λ J)^{- 1} X^{T} y . \end{matrix}$ Compared with regular $\hat{β} = (X^{T} X)^{- 1} X^{T} y$ , the only difference is the $λ J$ term.

For LASSO, $f (β) = (y - β)^{2} + λ | β |$ .

Fact (Simple LASSO)

The minimizer of $f (β) = (y - β)^{2} + λ | β |$ is given by $\begin{array}{r} {\begin{aligned} y - λ / 2, y > λ / 2, \\ y + λ / 2, y < - λ / 2, \\ 0, - λ / 2 \leq y \leq λ / 2. \end{aligned} \end{array}$

Proof

The derivative of $f$ is given by $f^{'} (β) = {\begin{aligned} 2 (β - y) + λ, β > 0, \\ 2 (β - y) - λ, β < 0. \end{aligned}$ At $β = 0$ , the function $| β |$ is not differentiable. We now need to set derivative to zero for $β > 0$ , we have $β = y - \frac{λ}{2}$ ; and for this to hold, we have to assume $y > \frac{λ}{2}$ .
Similarly, for $β < 0$ , $β = y + \frac{λ}{2}$ , and we have to assume $y < - \frac{λ}{2}$ .
In the intermediate range $- \frac{λ}{2} \leq y \leq \frac{λ}{2}$ , check that $f^{'} (β) < 0$ for $β < 0$ and $f^{'} (β) > 0$ for $β > 0$ .

3 Cross Validation for Selecting $λ$

We can use cross validation to select $λ$ . First split the total set $T = {1, \dots, n}$ into $T_{train}, T_{test}$ . (say 80%: 20%). For this split, fit the model in $T_{train}$ , and obtain ${\hat{β}}_{train}^{ridge} (λ)$ as minimizer of $\begin{aligned} \sum_{t \in T_{train}} (y_{t} - β_{0} - β_{1} (t - 1) - β_{2} ReLU (t - 2) - \dots - β_{n - 1} ReLU (t - (n - 1)))^{2} \\ + λ (β_{2}^{2} + \dots + β_{n - 1}^{2}), \end{aligned}$ and ${\hat{β}}_{train}^{LASSO} (λ)$ as $\begin{aligned} \sum_{t \in T_{train}} (y_{t} - β_{0} - β_{1} (t - 1) - β_{2} ReLU (t - 2) - \dots - β_{n - 1} ReLU (t - (n - 1)))^{2} \\ + λ (| β_{2} | + \dots + | β_{n - 1} |) . \end{aligned}$
Using these estimates, predict $y_{t}$ for $t \in T_{test}$ : $\begin{array}{r} {\hat{y}}_{t}^{ridge} (λ) = {\hat{β}}_{train, 0}^{ridge} (λ) + {\hat{β}}_{train, 1}^{ridge} (λ) (t - 1) + {\hat{β}}_{train, 2}^{ridge} (λ) ReLU (t - 2) + \dots + {\hat{β}}_{train, n - 1}^{ridge} (λ) ReLU (t - (n - 1)), \\ {\hat{y}}_{t}^{lasso} (λ) = {\hat{β}}_{train, 0}^{lasso} (λ) + {\hat{β}}_{train, 1}^{lasso} (λ) (t - 1) + {\hat{β}}_{train, 2}^{lasso} (λ) ReLU (t - 2) + \dots + {\hat{β}}_{train, n - 1}^{lasso} (λ) ReLU (t - (n - 1)) . \end{array}$
Denote $\begin{array}{r} {Test-Error}^{ridge} (λ) = \sum_{t \in T_{test}} (y_{t} - {\hat{y}}_{t}^{ridge} (λ))^{2}, \\ {Test-Error}^{lasso} (λ) = \sum_{t \in T_{test}} (y_{t} - {\hat{y}}_{t}^{lasso} (λ))^{2} . \end{array}$
Going over all splits, we can have the total test error: $\begin{array}{r} {AllSplit-Test-Error}^{ridge} (λ) = \sum_{all splits} {Test-Error}^{ridge} (λ), \\ {AllSplit-Test-Error}^{lasso} (λ) = \sum_{all splits} {Test-Error}^{lasso} (λ) . \end{array}$
We apply this to a set of candidate $λ$ values and choose the optimal $λ$ that minimizes the all split test error.

A common split for test set may be, ${5 k + i, k \in N}$ for $i = 1, 2, 3, 4, 5$ .